DSRM Report: The PAMAP2 Physical Activity Monitoring dataset

RegNo: 22116789

1 Introdution

1.1 Background

The "UK Chief Medical Officers' Physical Activity Guidelines" provides recommended physical activity guidelines including duration and intensity for different ages to achieve optimal health. Types of physical activities and their intensities are divided into 5 categories: Sedentary, Light, Moderate, Vigorous, and Very vigorous.

1.2 Data Description

The dataset contains data of 18 different physical activities (such as walking, cycling, playing soccer etc) performed by 9 subjects wearing 3 inertial measurement units (IMU) and a heart rate monitor. This data is stored in individual text files per subject. Each row in each file represents one reading and contains 54 attributes (including timestamp, activity ID, heart rate and IMU sensory data).. Synchronized and labeled raw data from all the sensors (3 IMUs and the HR-monitor) is merged into 1 data file per subject per session (protocol or optional), available as text-files (.dat).

Each of the data-files contains 54 columns per row, the columns contain the following data:

The IMU sensory data contains the following columns:

1.3 Problem Statement

To develop a software which can determine the amount and type of physical activity carried out by an individual we need to analyse the dataset to answer these questions:

The results can help making a software that distiguish the intensitiy and duration of the activities an individual is doing, send daily/weekly remminders to do more of a specific category, send motivation and encouragement in case of exceeding the goals.

2 Data Exploration

2.1 Load Data

2.2 Rename Columns

According to the data description there are 3 IMU sensors and every one of them have 3 columns to describe its measurements (x,y,z). For the oriention it is descriped by four columns (x,y,z,w). 'subjeID' column is added to distiguish between the 9 subjects.

2.3 Mapping Activities

Activities ids are replaced by their values to simplify the labeling of the visualisations

Check for duplicates

Concat the two DataFrames to make the analysis easier

3 Data Cleaning

3.1 Missing Values

To choose the right method for handling the NaN values, we need to understand the percentage of missing values in every column.

heart_rate had 90% NaNs because IMUs have a sampling frequency of 100Hz (0.01s between points), HR-monitor was 9Hz (0.11s between points), so handling it will be done in the step of creating new time window features

3.2 Dropping Columns

In the data file desciption it's adviced to use the acceleration_16 data, that's why the acceleration_6 is dropped. Same for the orientation data because it is invalid in this data collection.

4 Data Visualisation

4.1 For the same person can we distinguish the difference between the activities?

To answer this question activities done by subject06 will be plotted accross all the avilable time period

The answer to our question is yes, but from the previous plots if we depend on classifying every point(record) separately, it will be hard for a model to distinguish between the activities. But if we look at periodic time windows, we see that the more aggressive the activity is, the more the variance within every time window. Hence new accumlative features need to be created to help the model. We can also see that the optional data is overlapping with the protocol data yet they are not contradicting each other, which means we can use both of them. Ex: while protocol data is predicting lying and setting, optional data is predicting computer work for the same time period. Standard deviation is the key to divide the activities into 5 types.

4.2 Do people have similar measurements while doing the same activity?

To answer this question one sensor data will be plotted for all the subjectIDs.

We can conclude from these figures that change in measurement variablity is happening accross all subjects, which means we can use our model to classify new users data without depending on history about them. Saying that and since our subjects are doing the same activity with slightly different intensities, integrating our model with another model which has features about every subject's history will boost the accuracy of our overall predictions. Also there are some spikes that needs to be deleted because it might confuse the model. Ex: Looking at subject03 there are two time windows with a high variablity yet a 'lying' label. Deleting these spikes will avoid confusing the model between 'lying' and 'running' during the training process.

4.3 Are the three axes of the sensor following the same behaviour?

To answer this question chest_gyroscope sensor for subject0 data will be plotted across the 3 axes while 'lying' and 'rope_jumping'.

From the previous plots we can conclude that all the axes are following the same behaviour(High variablity in measurements for intense activities and vice versa), which means we can do the same pre-processing on all of them. The diffenece will be in the maximum threshold of deviation to consider it having a specific label or a spike(noise).

4.4 Distribution of activities across all subjects

To choose either to train on all subjects or to hold one of them to be used only for testing to mimic the real life case, the number of avilable activities need to be plotted for every subject.

All subject09 will be held to be part of the testing phase to evaluate the model performance on new users, knowing that its records are 5% of the total records.

4.5 Correlation and Multicollinearity

To try the logistic regression (linear based) in the modeling stage, we need to check the correlation between the features in order to avoid multicollinearity.

The features are moderately corrlated so no features will be deleted.

4.6 Summary

Step we need to do in the pre-processing based on the visualization process:

5 Data Pre-Processing

5.1 Aggregated windows

"Subject Cross Validation in Human Activity Recognition" paper concluded that overlapping windows are not worth their extra computational cost while dealing with Human Activity Recognition systems. That's why subject activities were grouped into windows of 100 non overlapping observations. And since the timestamp is based on the IMUs, which have a sampling frequency of 100Hz, each window will be 1 sec. For the HR-monitor with a sampling frequency of 9Hz, there will be almost 11 non NaN observation within each window.

Holding subject09 data based on the analysis done in section 4.4 to evaluate odel on completely new subjects.

5.2 Creating target

Cretaing the 5 Types of physical activities: Sedentary, Light, Moderate, Vigorous, and Very vigorous which we refered to in the background section

5.3 Split Data

Data need to be splited before creating new features to avoid data leakage and mmisleading evaluation matrix. As we explained subject09 will be held as part of the test data to mimic the new user's evaluation matrix

6 Hypothesis testing

We assumed from the previous visualisations that the standard deviation of the measurement changes according to the activty. To test this hypothesis we will look at std of for activity 0 (Sedentary) and activity 4 (Very vigorous).

6.1 Splitting Data

6.2 Hyothesis Conclusion

Both training and testing data sets rejected the null hypotheseis. We can conclude from that the model will have features to use to distinguish between the activities, and there is a significant difference between the hand_temperature_std during Sedentary compared to Very vigorous activities

7 Modeling

The problem is a classification one and there is a list of models that can be used including logistic regression, SVM, decision trees .. etc

Decision trees are fast when it comes to their performance, and they are robust to outliers to some extent. There most obvious drawback was the interpretability, and it's solved by having the SHAP graphs from which we can tell how the splits are done and if it's a positive or a negative relation with the target

XGB is based on decision trees, yet it uses the ensembling method to boost the classifier results by adding more weight in each step to every misclassified point. It's also easy to control the depth of each tree and to set a threshold of information gain for every split.

7.1 Implementing the algorithm

7.2 Feature importance

7.3 Validation

The results are too good so some invistigation steps

7.4 Testing

8 Conclusion

8.1 Technical Improvements

8.2 Actionable Plan